5 Probability
This chapter is a draft. There are errors.
The usual touchstone of whether what someone asserts is mere persuasion or at least a subjective conviction, i.e., firm belief, is betting. Often someone pronounces his propositions with such confident and inflexible defiance that he seems to have entirely laid aside all concern for error. A bet disconcerts him. Sometimes he reveals that he is persuaded enough for one ducat but not for ten. For he would happily bet one, but at ten he suddenly becomes aware of what he had not previously noticed, namely that it is quite possible that he has erred. -— Immanuel Kant, Critique of Pure Reason
The central tension, and opportunity, in data science is the interplay between the data and the science, between our empirical observations and the models which we use to understand them. Probability is the language we use to explore that interplay; it connects models to data, and data to models.
What does it mean that Trump had a 30% chance of winning re-election in the fall of 2016? That there is a 90% probability of rain today? That the dice at the casino are fair?
Probability is about quantifying uncertainty. Think of probability as a proportion. The probability of an event occurring is a number from 0 to 1, where 0 means that the event is impossible and 1 means that the event is 100% certain.
Begin with the simplest events: coin flips and dice rolls. The set of all outcomes is the sample space. With fair coins and dice, we know that:
- The probability of rolling a 1 or a 2 is 2/6, or 1/3.
- The probability of rolling a 1, 2, 3, 4, 5, or 6 is 1.
- The probability of flipping a coin and getting tails is 1/2.
If the probability of an outcome is unknown, we will often refer to it as an unknown parameter, something which we might use data to estimate. We usually use Greek letters to refer to parameters. Whenever we are talking about a specific probability (represented by a single value), we will use \(\rho\) (the Greek letter “rho” but spoken aloud as “p” by us) with a subscript which specifies the exact outcome of which it is the probability. For instance, \(\rho_h = 0.5\) denotes the probability of getting heads on a coin toss when the coin is fair. \(\rho_t\) — spoken as “PT” or “P sub T” or “P tails” — denotes the probability of getting tails on a coin toss. This notation can become annoying if the outcome whose probability we seek is not so easy to define. For example, we might write the probability of rolling a 1, 2 or 3 using a fair die as:
\[ \rho_{die\ roll\ is\ 1,\ 2\ or\ 3} = 0.5 \]
We will rarely write out the full definition of the event with the \(\rho\) symbol. Instead, we will define an event A as when a rolled die equals 1, 2 or 3 and, then, write
\[\rho_A = 0.5\].
A random variable is a function which produces a value from a sample set. A random variable can be either discrete — where the sample set has a limited number of members, like H or T for the result of a coin flip, or 2, 3, …, 12 for the sum of two die — or continuous (any value in a continuous range). Probability is a claim about the value of a random variable, i.e., that you have a 50% probability of getting a 1, 2 or 3 when you roll a fair die.
We usually use capital letters for random variables. So, \(C\) might be our symbol for the random variable which is a coin toss and \(D\) might be our symbol for the random variable which is the sum of two dice. When discussing random variables in general, or when we grow tired of coming up with new symbols, we will use \(Y\).
Small letters refer to a single outcome or result from a random variable. \(c\) is the outcome from one coin toss. \(d\) is the result from one throw of the dice. The value of the outcome must come from the sample space. So, \(c\) can only take on two possible values: heads or tails. When discussing random variables in general, we use \(y\) to refer to one outcome of the random variable \(y\). If there are multiple outcomes — if we have, for example, flipped the coin multiple times — then we use subscripts to indicate the separate outcomes: \(y_1\), \(y_2\), and so on. The symbol for an arbitrary outcome is \(y_i\).
The only package we need in this chapter is tidyverse.
To understand probability more fully, we first need to understand distributions.
5.1 Distributions
A variable in a tibble is a column, a vector of values. We sometimes refer to this vector as a “distribution.” This is somewhat sloppy in that a distribution can be many things, most commonly a mathematical formula. But, strictly speaking, a “frequency distribution” or an “empirical distribution” is a list of values, so this usage is not unreasonable.
5.1.1 Scaling distributions
Consider the vector which is the result of rolling one die 10 times.
ten_rolls <- c(5, 5, 1, 5, 4, 2, 6, 2, 1, 5)There are other ways of storing the data in this vector. Instead of reporting every observation, we could just record the number of times each value appears.
| Distribution of Ten Rolls of a Fair Die | ||
|---|---|---|
| Counts and percentages reflect the same information | ||
| Outcome | Count | Percentage |
| 1 | 2 | 20% |
| 2 | 2 | 20% |
| 4 | 1 | 10% |
| 5 | 4 | 40% |
| 6 | 1 | 10% |
In this case, with only 10 values, it is actually less efficient to store the data like this. But what happens when we have 10,000 rolls?
| Distribution of One Thousand Rolls of a Fair Die | ||
|---|---|---|
| Counts and percentages reflect the same information | ||
| Outcome | Count | Percentage |
| 1 | 190 | 19% |
| 2 | 138 | 14% |
| 3 | 160 | 16% |
| 4 | 173 | 17% |
| 5 | 169 | 17% |
| 6 | 170 | 17% |
Instead of keeping around a vector of length 1,000, we can just keep 12 values — the 6 possible outcomes and their frequency — without losing any information.
Two distributions can be identical even if they are of very different lengths. Let’s compare our original distribution of 10 rolls of the die with another distribution which just features 100 copies of those 10 rolls.
more_rolls <- rep(ten_rolls, 100)
The two graphs have the exact same shape because, even though the vectors are of different lengths, the relative proportions of the outcomes are identical. In some sense, both vectors are from the same distribution. The total count for each value does not matter. What matters is the relative proportions.
5.1.2 Normalizing distributions
If two distributions have the same shape, then they only differ by the labels on the y-axis. There are various ways of “normalizing” distributions to make them all the same scale. The most common scale is one in which the area under the distribution adds to 1, e.g., 100%. For example, we can transform the plots above to look like:

We sometimes refer to a distribution as “unnormalized” if the area under the curve does not add up to 1.
5.1.3 Simulating distributions
There are two distinct concepts: a distribution and a set values drawn from that distribution. But, in everyday use, we use “distribution” for both. When given a distribution (meaning a vector of numbers), we often use geom_histogram() or geom_density() to graph it. But, sometimes, we don’t want to look at the whole thing. We just want some summary measures which report the key aspects of the distribution. The two most important attributes of a distribution are its center and its variation around that center.
We use summarize() to calculate statistics for a variable, a column, a vector of values or a distribution. Note the language sloppiness. For the purposes of this book, “variable,” “column,” “vector,” and “distribution” all mean the same thing. Popular statistical functions include: mean(), median(), min(), max(), n() and sum(). Functions which may be new to you include three measures of the “spread” of a distribution: sd() (the standard deviation), mad() (the scaled median absolute deviation) and quantile(), which is used to calculate an interval which includes a specified proportion of the values.
Think of the distribution of a variable as an urn from which we can pull out, at random, values for that variable. Drawing a thousand or so values from that urn, and then looking at a histogram, can show where the values are centered and how they vary. Because people are sloppy, they will use the word distribution to refer to at least three related entities:
- the (imaginary!) urn from which we are drawing values.
- all the values in the urn
- all the values which we have drawn from the urn, whether that be 10 or 10,000
Sloppiness is the usage if the word distribution is universal. However, you must keep three distinct ideas separate:
The unknown true distribution which, in reality, generates the data which we see. Outside of stylized examples in which we assume that a distribution follows a simple mathematical formula, we will never have access to the unknown true distribution. We can only estimate it. This unknown true distribution is often referred to as the data generating mechanism, or DGM. It is a function or black box or urn which produces data. We can see the data. We can’t see the urn.
The estimated distribution which, we think, generates the data which we see. Again, we can never know the unknown true distribution. But, by making some assumptions and using the data we have, we can estimate a distribution. Our estimate may be very close to the true distribution. Or it may be far away. The main task of data science to to create and use these estimated distributions. Almost always, these distributions are instantiated in computer code. Just as there is a true data generating mechanism associated with the (unknown) true distribution, there is an estimated data generating mechanism associated with the estimated ditribution.
A vector of numbers drawn from the estimated distribution. Both true and estimated distributions can be complex animals, difficult to describe accurately and in detail. But a vector of numbers drawn from a distribution is easy to understand and use. So, in general, we work with vectors of numbers. When someone — either a colleague or a piece of R code — creates a distribution which we want to use to answer a question, we don’t really want the distribution itself. Rather, we want a vectors of “draws” from that distribution. Vectors are easy to work with! Complex computer code is not.
Again, people (including us!) will often be sloppy and use the same word, “distribution,” without making it clear whether they are talking about the true distribution, the estimated distribution, or a vector of draws from the estimated distribution. The same sloppiness applues to the use of the term data generating mechanism. Try not to be sloppy.
Much of the rest of the Primer involves learning how to work with distributions, which generally means working with the draws from those distributions. Fortunately, the usual rules of arithmetic apply. You can add/subtract/multiply/divide distributions by working with draws from those distributions, just as you can add/subtract/multiply/divide regular numbers.
5.2 Probability distributions

FIGURE 5.1: Bruno de Finetti, an Italian statistician who wrote a famous treatise on the theory of probability that began with the statement “PROBABILITY DOES NOT EXIST.”
For the purposes of this Primer, a probability distribution is a mathematical object which maps a set of outcomes to probabilities, where each distinct outcome has a chance of occurring between 0 and 1 inclusive. The probabilities must sum to 1. The set of possible outcomes, i.e., the sample space — heads or tails for the coin, 1 through 6 for a single die, 2 through 12 for the sum of a pair of dice — can be either discrete or continuous. Remember, discrete data can only take on certain values. Continuous data, like height and weight, which can take any values within a range. The set of outcomes is the domain of the probability distribution. The range is the associated probabilities.
Assume that a probability distribution is created by a probability function, a set function which maps outcomes to probabilities. The concept of a “probability function” is often split into two categories: probability mass functions (for discrete random variables) and probability density functions (for continuous random variables). As usual, we will be a bit sloppy, using the term probability distribution for both the mapping itself and for the function which creates the mapping.
We discuss three types of probability distributions: empirical, mathematical, and posterior.
The key difference between a distribution, as we have explored them in Section 5.1, and a probability distribution is the requirement that the sum of the probabilities of the individual outcomes must be exactly 1. There is no such requirement for a distribution in general. But any distribution can be turned into a probability distribution by “normalizing” it. In this context, we will often refer to a distribution which is not (yet) a probability distribution as an “unnormalized” distribution.
Pay attention to notation. Whenever we are talking about a specific probability (represented by a single value), we will use \(\rho\) (the Greek letter “rho” but spoken aloud as “p” by us) with a subscript which specifies the exact outcome of which it is the probability. For instance, \(\rho_h = 0.5\) denotes the probability of getting heads on a coin toss when the coin is fair. \(\rho_t\) — spoken as “PT” or “P sub T” or “P tails” — denotes the probability of getting tails on a coin toss. However, when we are referring to the entire probability distribution over a set of outcomes, we will use \(P()\). For example, the probability distribution of a coin toss is \(P(\text{coin})\). That is, \(P(\text{coin})\) is composed of the two specific probabilities (50% and 50%) mapped from the two values in the domain (Heads and Tails). Similarly, \(P(\text{sum of two dice})\) is the probability distribution over the set of 11 outcomes (2 through 12) which are possible when you take the sum of two dice. \(P(\text{sum of two dice})\) is made up of 11 numbers — \(\rho_2\), \(\rho_3\), …, \(\rho_{12}\) — each representing the unknown probability that the sum will equal their value. That is, \(\rho_2\) is the probability of rolling a 2.
5.2.1 Flipping a coin
All data science problems start with a question. Example: What will be the result of the next flip of a coin? All questions are answered with the help of probability distributions.
An empirical distribution is based on data. You can think of this as the probability distribution created by running a simulation. In theory, if we increase the number of coins we flip in our simulation, the empirical distribution will look more and more similar to the mathematical distribution. The mathematical distribution is the Platonic form. The empirical distribution will often look like the mathematical probability distribution, but it will rarely be exactly the same.
In this simulation, there are 56 heads and 44 tails. The outcome will vary every time we run the simulation, but the proportion of heads to tails should not be too different if this coin is fair.
# We are flipping one fair coin a hundreds times. We need to get the same result
# each time we create this graphic because we want the results to match the
# description in the text. Using set.seed() guarantees that the random results
# are the same each time. We define 0 as tails and 1 as heads.
set.seed(3)
tibble(results = sample(c(0, 1), 100, replace = TRUE)) %>%
ggplot(aes(x = results)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
binwidth = 0.5,
color = "white") +
labs(title = "Empirical Probability Distribution",
subtitle = "Flipping one coin a hundred times",
x = "Outcome\nResult of Coin Flip",
y = "Probability") +
scale_x_continuous(breaks = c(0, 1),
labels = c("Heads", "Tails")) +
scale_y_continuous(labels =
scales::percent_format(accuracy = 1)) +
theme_classic()
A mathematical distribution is based on a mathematical formula. Assuming that the coin is perfectly fair, we should, on average, get heads as often as we get tails.

The distribution of a single observation is described by this formula.
\[ P(Y = y) = \begin{cases} 1/2 &\text{for }y= \text{Heads}\\ 1/2 &\text{for }y= \text{Tails} \end{cases}\]
We could also use this formula: \(y_i \sim Bernoulli(p)\). Expected value of \(y_i\) is this mathematical probability distribution. If the mathematical assumptions are correct, then as your sample size increases, the empirical probability distribution will look more and more like the mathematical distribution.
A posterior distribution is based on beliefs and expectations. It displays your belief about things you can’t see right now. You may have posterior distributions for outcomes in the past, present, or future.
In the case of the coin toss, the posterior distribution changes depending on your beliefs. For instance, let’s say your friend brought a coin to school and asked to bet you. If the result is heads, you have to pay them $5.
This makes you suspicious, or in other world you no longer trust the “population” that was made in the previous two examples, that is all the results and probability was based on flipping a fair coin, and that’s why the only way left (if you don’t jsut simply walk away) is your posterior distribution. Your posterior distribution reflects or beliefs based on the assumption, this time the population is no longer the fair dice where we define \(\rho_h\)=0.5, is the population of “crooked” dice where you might believe that \(\rho_h\) is 0.95 and \(\rho_t\) is 0.05.

The full terminology is mathematical (or empirical or posterior) probability distribution. But we will often shorten this to just mathematical (or empirical or posterior) distribution. The word “probability” is understood, even if it is not present.
Population is an very key concepts in data analysis, and will use throughout later chapters. The definition for Population is essentially the imaginary urn from which our data has been, or will be, drawn.
5.2.2 Rolling two dice
We get an empirical distribution by rolling two dice a hundred times, either by hand or with a computer simulation. The result is not identical to the mathematical distribution because of the inherent randomness of the real world and/or of simulation.
# In the coin example, we create the vector ahead of time, and then assigned
# that vector to a tibble. There was nothing wrong with that approach. And we
# could do the same thing here. But the use of map_* functions is more powerful,
# although it requires creating the 100 rows of the tibble at the start and then
# doing things "row-by_row."
set.seed(1)
emp_dist_dice <- tibble(ID = 1:100) %>%
mutate(die_1 = map_dbl(ID, ~ sample(c(1:6), size = 1))) %>%
mutate(die_2 = map_dbl(ID, ~ sample(c(1:6), size = 1))) %>%
mutate(sum = die_1 + die_2) %>%
ggplot(aes(x = sum)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
binwidth = 1,
color = "white") +
labs(title = "Empirical Probability Distribution",
subtitle = "Sum from rolling two dice, replicated one hundred times",
x = "Outcome\nSum of Two Die",
y = "Probability") +
scale_x_continuous(breaks = seq(2, 12, 1), labels = 2:12) +
scale_y_continuous(labels =
scales::percent_format(accuracy = 1)) +
theme_classic()
emp_dist_dice
We might consider labeling the y-axis in plots of empirical distributions as “Proportion” rather than “Probability” since it is an actual proportion, calculated from real (or simulated) data. We will keep it as “Probability” since we want to emphasize the parallels between mathematical, empirical and posterior probability distributions.
Our mathematical distribution tells us that, with a fair dice, the probability of getting 1, 2, 3, 4, 5, and 6 are equal: there is a 1/6 chance of each. When we roll two dice at the same time and sum the numbers, the values closest to the middle are more common than values at the edge because there are more combinations of numbers that add up to the middle values.

\[ P(Y = y) = \begin{cases} \dfrac{y-1}{36} &\text{for }y=1,2,3,4,5,6 \\ \dfrac{13-y}{36} &\text{for }y=7,8,9,10,11,12 \\ 0 &\text{otherwise} \end{cases} \]
The posterior distribution for rolling two dice a hundred times depends on your beliefs. If you take the dice from your Monopoly set, you have reason to believe that the assumptions underlying the mathematical distribution are true. However, if you walk into a crooked casino and a host asks you to play craps, you might be suspicious, just as in the “flipping a coin example” the word “suspicious” means you no longer trust the “population” where the mathematical and empircal distribution drawn their data from. For example, in craps, a come-out roll of 7 and 11 is a “natural,” resulting in a win for the “shooter” and a loss for the casino. You might expect those numbers to occur less often than they would with fair dice. Meanwhile, a come-out roll of 2, 3 or 12 is a loss for the shooter. You might also expect values like 2, 3 and 12 to occur more frequently. Your posterior distribution might look like this:

Someone less suspicious of the casino would have a posterior distribution which looks more like the mathematical distribution.
Recall Population, Population is the imaginary urn from which our data has been, or will be, drawn. A perfect probability distribution is simply pour all the data in the population (urn), and count them, however this is impossible because there are infinite or very large number of data out there. For rolling two dice for example, our population is rolling two dice infinite times and calculate the sum for each time. Because it’s never possible to get access to all the data, that’s why we need the Mathematical and the Empirical distribution. Yet, the Posterior distribution is different because we essenciaially switching to another population(urn) for the data we believe.
5.2.3 Presidential elections
Now let’s say we are building probability distributions for political events, like a presidential election. We want to know the probability that Democratic candidate wins X electoral votes, where X comes from the range of possible outcomes: 0 to 538. (The total number of electoral votes in US elections since 1964 is 538.)
The empirical distribution in this case could involve looking into past elections in the United States and counting the number of electoral votes that the Democrats won in each. For the empirical distribution, we create a tibble with electoral vote results from past elections. Looking at elections since 1964, we can observe that the number of electoral votes that the Democrats received in each one is different. Given that we only have 15 entries, it is difficult to draw conclusions or make predictions based off of this empirical distribution.
However, this model is enough to suggest that the assumptions of the mathematical probability distribution above do not work for electoral votes. The model assumes that the Democrats have a 50% chance of receiving each of the 538 votes. Just looking at the mathematical probability distribution, we can observe that receiving 13 or 17 or 486 votes out of 538 would be extreme and almost impossible under this mathematical model. However, our empirical distribution tells us that those were real election results.

We can build a mathematical distribution for X which assumes that the chances of the Democratic candidate winning any given state’s electoral votes is 0.5 and the results from each state are independent.

If our assumption for this mathematical distribution is correct (we don’t), then the as the sample size increase the empirical distribution should look more and more similar to the our mathematical distribution.
We know that campaign platforms, donations, charisma, and many other factors will contribute to a candidate’s success. Elections are more complicated than coin tosses. We also know that many presidential elections in history have resulted in much bigger victories or defeats than this distribution seems to allow for.
The posterior distribution of electoral votes is a popular topic, and an area of strong disagreement, among data scientists. Consider this posterior from FiveThirtyEight.

Here is a posterior from the FiveThirtyEight website from August 13, 2020. This was created using the same data as the above distribution, but simply displayed differently. For each electoral result, the height of the bar represents the probability that a given event will occur. However, there are no lablels y-axis telling us what the specific probability of each outcome is. And that is OK! The specific values are not that useful. If we removed the labels on our y-axes, would it matter?

Here is the posterior from The Economist, also from August 13, 2020. This looks confusing at first because they chose to merge the axes for Republican and Democratic electoral votes. We can tell that The Economist was less optimistic, relative to FiveThirtyEight, about Trump’s chances in the election.

These two models, built by smart people using similar data sources, have reached fairly different conclusions. Data science is difficult! There is not one “right” answer. Real life is not a problem set.
 and [here](https://statmodeling.stat.columbia.edu/2020/08/31/problem-of-the-between-state-correlations-in-the-fivethirtyeight-election-forecast/).](05-probability/images/538_versus_Economist.png)
FIGURE 5.2: Watch the makers of these two models throw shade at each other on Twitter! Eliot Morris is one of the primary authors of the Economist model. Nate Silver is in charge of 538. They don’t seem to be too impressed with each other’s work! More smack talk here and here.
There are many political science questions you could explore with posterior distributions. They can relate to the past, present, or future.
- Past: How many electoral votes would Hilary Clinton have won if she had picked a different VP?
- Present: What are the total campaign donations from Harvard faculty?
- Future: How many electoral votes will the Democratic candidate for president win in 2024?
The concept of population is key to this topic, because if we assume the population of the past is the same or very similar to the future population, then we can basically predict the future based on past performances. Let’s now analyses the three different probability distribution of Presidential election.
Mathematical Distribution is saying that if we know the Data-Generating Mechanisms (DGM), we can produce the data on our own, in this case our DGM is assuming for all 50 states in the US, the Democrats and Republicans have exactly equal chance of winning.
Empirical Distribution is saying that since it’s impossible to get all the data for all presidential election results, past and future, we can look at the historical past elections results and assuming it’s representative for all elections.
Posterior Distribution is saying that we don’t believe what the above two is saying, in fact we believe that the current population is unreliable , so instead we want to create an new population, an new imaginary urn, where we think is trustworthy, and use the data in the new population to graph our probability distribution.
5.2.4 Height
Question: What is the height of the next adult male we will meet?
The three examples above are all discrete probability distributions, meaning that the outcome variable can only take on a limited set of values. A coin flip has two outcomes. The sum of a pair of dice has 11 outcomes. The total electoral votes for the Democratic candidate has 539 possible outcomes. In the limit, we can also create continuous probability distributions which have an infinite number of possible outcomes. For example, the average height for an American male could be any real number between 0 inches and 100 inches. (Of course, an average value anywhere near 0 or 100 is absurd. The point is that the average could be 68.564, 68.5643, 68.56432 68.564327, or any real number.)
All the characteristics for discrete probability distributions which we reviewed above apply just as much to continuous probability distributions. For example, we can create mathematical, empirical and posterior probability distributions for continuous outcomes just as we did for discrete outcomes.
The empirical distribution involves using data from the National Health and Nutrition Examination Survey (NHANES). What we are doing here is instead making an model by ourself using some mathematical formula, we use the actual data, we can get the data from either simulated by our own like in the “flipping a coin” and “Rolling two dice” scenario, or we used the data from someone else, like the presidential election and this scenario.

Mathematical distribution is complete based on mathematical formula and assumptions like in the Flipping a coin session we assume that the coin is an perfectly fair coin where where the probability landing on heads or tails is equal. In this case, we assume that the average hight of men is 175 cm, as well as the standard deviation for height is around 9 cm. When we have these two values, the average which we also called the mean, and standard deviation (sd), we can create an normal distribution using the rnorm() function. And an normal distribution is an good approximation and generalization for height in our scenario.
Mathematical Distribution:

Again, the Normal distribution which is an probability distribution that is symmetric about the mean is described by this formula.
\[y_i \sim N(\mu, \sigma^2)\].
Each value \(y_i\) is drawn from a normal distribution with parameters \(\mu\) for the mean and \(\sigma\) for the standard deviation. If the mathematical assumptions are correct in this case the two parameters \(\mu\) and \(\sigma\), then as our sample size increases, the empirical probability distribution will look more and more like the mathematical distribution.
The posterior distribution for heights depends on the context. Are we considering all the adult men in America? In that case, our posterior would probably look a lot like the empirical distribution using NHANES data. If we are being asked about the distribution of heights among players in the NBA, then our posterior might look like:

In general when we think about the difference between these three distributions, we can think about the mathematical as completely theoretical, imagine we only have an paper and a pencil, and we were ask to create a model to represent our population,and that model is the mathematical distributions. Empirical distribution on the other hand is different, empirical distribution is completely based on data, there is no such thing as writing down formula or equation and coming up with a distribution based on that, we come up with empirical distribution by analyzing data, you can think of empirical distribution as someone flip a coin 1000 times, and we record the data and creating a graph based on that, and believe that the data we pulled up is representative for the entire population. Last but not least, Posterior distribution this graph represents one’s belief about certain situation, for example when we think about flipping a coin, our mathematical and empirical results would likely look similar, and if there is no other assumption our posterior should also looks similar, but if were in a casino, and someone tells you that they want to bet 100 bucks on head and you bet 100 bucks on tail, who wins we get’s all the money. Now you maybe suspicious, you won’t likely to trust this person who is convincing you to bet, now the posterior distribution could help you to visualize and quantified your beliefs which is extremely helpful when making decision.
In short:
- Mathematical distribution is based on mathematical formula and some basic assumptions
- Empirical distribution is based on data, the data could either done by yourself or somebody else.
- Posterior distribution is based on belief, the belief is usually shaped by added information about the scenario.
There is a direct connection between the concept of the population and the concept of a probability distribution. But they are not the same thing! The connection is that the probability distribution — whether it is empirical, mathematical, or posterior — is a description of the contents of the urn, which is the population.
And Again: Population is the imaginary urn from which our data has been, or will be, drawn.
Comments:

FIGURE 5.3: The truth is out there
The truth is out there. If we asked all 300+ million Americans whether or not they approve of President Biden, we could know \(p\) exactly. Alas, we can’t do that. We use a posterior probability distribution to summarize are beliefs about the true value of \(p\), a truth we can never confirm.
Continuous variables are a myth. Nothing that can be represented on a computer is truly continuous. Even something which appears continuous, like \(p\), actually can only take on a (very large) set of discrete variables. In this case, there are approximately 300 million possible true values of \(p\), one for each total number of people who approve of President Biden.
The math of continuous probability distributions can be tricky. Read a book on mathematical probability for all the messy details. Little of that matters in applied work.
The most important difference is that, with discrete distributions, it makes sense to estimate the probability of a specific outcome. What is the probability of rolling a 9? With continuous distributions, this makes no sense because there are an infinite number of possible outcomes. With continuous variables, we only estimate intervals.
Don’t worry about the distinctions between discrete and continuous outcomes, or between the discrete and continuous probability distributions which we will use to summarize our beliefs about those outcomes. The basic intuition is the same in both cases.
5.2.5 Joint distributions
Recall that \(P(\text{coin})\) is the probability distribution for the result of a coin toss. It includes two parts, the probability of heads (\(\rho_h\)) and the probability of tails (\(\rho_t\)). This is a univariate distribution because there is only one outcome, which can be heads or tails. If there is more than one outcome, then we have a joint distribution.
Joint distributions are also mathematical objects that cover a set of outcomes, where each distinct outcome has a chance of occurring between 0 and 1 and the sum of all chances must equal 1. The key to a joint distribution is it measures the chance that both events A and B will occur. The notation is \(P(A, B)\).
Let’s say that you are rolling two six-sided dice simultaneously. Die 1 is weighted so that there is a 50% chance of rolling a 6 and a 10% chance of each of the other values. Die 2 is weighted so there is a 50% chance of rolling a 5 and a 10% chance of rolling each of the other values. Let’s roll both dice 1,000 times. In previous examples involving two dice, we cared about the sum of results and not the outcomes of the first versus the second die of each simulation. With a joint distributions, the order matters; so instead of 11 possible outcomes on the x-axis of our distribution plot (ranging from 2 to 12), we have 36. Furthermore, a 2D probability distribution is not sufficient to represent all of the variables involved, so the joint distribution for this example is displayed using a 3D plot.
5.2.6 Conditional distrubutions
Imagine that 60% of people in a community have a disease. A doctor develops a test to determine if a random person has the disease. However, this test isn’t 100% accurate. There is an 80% probability of correctly returning positive if the person has the disease and 90% probability of correctly returning negative if the person does not have the disease.
The probability of a random person having the disease is 0.6. Since each person either has the disease or doesn’t (those are the only two possibilities), the probability that a person does not have the disease is \(1 - 0.6 = 0.4\).

If the random person has the disease, then we go up the top branch. The probability of an infected person testing positive is 0.8 because the test is 80% sure of correctly returning positive when the person has the disease.
By the same logic, if the random person does not have the disease, we go down the bottom branch. The probability of the person incorrectly testing positive is 0.1.
We decide to go down the top branch if our random person has the disease. We go down the bottom branch if they do not. This is called conditional probability. The probability of testing positive is dependent on whether the person has the disease.
How would you express this in statistical notation? \(P(A|B)\) is the same thing as the probability of A given B. \(P(A|B)\) essentially means the probability of A if we know for sure the value of B. Note that \(P(A|B)\) is not the same thing as \(P(B|A)\).
In summary, we work with three main categories of probability distributions. First, p(A) is the probability distribution for event A. This is sometimes refered to as a univariate probability distribution because there is only one random variable. Second, p(A, B) is the joint probability distribution of A and B. Third, p(A | B) is the conditional probability distribution of A given that B has taken on a specific value. This is often written as p(A | B = b).
5.3 List-columns and map functions
Before working with probability models, we need to expand our collection of R tricks by understanding list-columns and map_* functions. Recall that a list is different from an atomic vector. In atomic vectors, each element of the vector has one value. Lists, however, can contain vectors, and even more complex objects, as elements.
## [[1]]
## [1] 4 16 9
##
## [[2]]
## [1] "A" "Z"
x is a list with two elements. The first element is a numeric vector of length 3. The second element is a character vector of length 2. We use [[]] to extract specific elements.
x[[1]]## [1] 4 16 9
There are a number of built-in R functions that output lists. For example, the ggplot objects you have been making store all of the plot information in lists. Any function that returns multiple values can be used to create a list output by wrapping that returned object with list().
## # A tibble: 1 x 1
## col_1
## <list>
## 1 <dbl [2]>
Notice this is a 1-by-1 tibble with one observation, which is a list of one element. Voila! You have just created a list-column.
If a function returns multiple values as a vector, like range() does, you must use list() as a wrapper if you want to create a list-column.
A list column is a column of your data which is a list rather than an atomic vector. As with stand-alone list objects, you can pipe to str() to examine the column.
# tibble() is what we use to generate a tibble, it acts sort of like the mutate(), but mutate() needs a data frame to add new column, tibble can survive on itself.
tibble(col_1 = list(range(x))) %>%
str()## tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
## $ col_1:List of 1
## ..$ : num [1:2] -1.01 1.42
We can use map_* functions to both create a list-column and then, much more importantly, work with that list-column afterwards.
# .x is col_1 from tibble and ~ sum(.) is the formula
tibble(col_1 = list(range(x))) %>%
mutate(col_2 = map_dbl(col_1, ~ sum(.))) %>%
str()## tibble [1 × 2] (S3: tbl_df/tbl/data.frame)
## $ col_1:List of 1
## ..$ : num [1:2] -1.01 1.42
## $ col_2: num 0.415
map_* functions, like map_dbl() in this example, take two key arguments, .x (the data which will be acted on) and .f (the function which will act on this data). Here, .x is the data in col_1, which is a list-column. .f is the function sum(). However, we can not simply write map_dbl(col_1, sum). Instead, each use of map_* functions requires the use of a tilde — a ~ — to indicate the start of the function and the use of a dot — a . — to specify where the data goes in the function.
map_* functions are a family of functions, with the suffix specifying the type of the object to be returned. map() itself returns a list. map_dbl() returns a double. map_int() returns an integer. map_chr() returns a character, and so on.
To summarise map function and map_* functions could both convert an data (vector or list) with the specific denoted functions or formula you set, and always results in list we called the “list Column.” There are two arguments in the map() function as well as map_*() function the .x and .f, the .x and .f placed in the map() functions and map_*()functions like this map(.x,.f), where .x could either be a list or a vector, and .f either be a direct function like .f=mean, or an formula like .f= ~mean(.x), remember to put the ~ if you are using .f an formula, the difference between map and map_* function is when you know and want the outcome of the data to be specific data vector like (double, logical,character,integer) rather than an general list in map(), you can use map_* instead of map to organized your list column.
tibble(ID = 1) %>%
mutate(col_1 = map(ID, ~range(rnorm(10)))) %>%
mutate(col_2 = map_dbl(col_1, ~ sum(.))) %>%
mutate(col_3 = map_int(col_1, ~ length(.))) %>%
mutate(col_4 = map_chr(col_1, ~ sum(.))) %>%
str()## tibble [1 × 5] (S3: tbl_df/tbl/data.frame)
## $ ID : num 1
## $ col_1:List of 1
## ..$ : num [1:2] -1.45 1.31
## $ col_2: num -0.143
## $ col_3: int 2
## $ col_4: chr "-0.142989"
Consider a more detailed example:
# This simple example demonstrates the workflow which we will often follow.
# Start by creating a tibble which will be used to store the results. (Or start
# with a tibble which already exists and to which you will be adding more
# columns.) It is often convenient to get all the code working with just a few
# rows. Once it is working, we increase the number of rows to a thousand or
# million or whatever we need.
tibble(ID = 1:3) %>%
# The big convenience is being able to store a list in each row of the tibble.
# Note that we are not using the value of ID in the call to rnorm(). (That is
# why we don't have a "." anywhere.) But we are still using ID as a way of
# iterating through each row; ID is keeping count for us, in a sense.
mutate(draws = map(ID, ~ rnorm(10))) %>%
# Each succeeding step of the pipe works with columns already in the tibble
# while, in general, adding more columns. The next step calculates the max
# value in each of the draw vectors. We use map_dbl() because we know that
# max() will returns a single number.
mutate(max = map_dbl(draws, ~ max(.))) %>%
# We will often need to calculate more than one item from a given column like
# draws. For example, in addition to knowing the max value, we would like to
# know the range. Because the range is a vector, we need to store the result
# in a list column. map() does that for us automatically.
mutate(min_max = map(draws, ~ range(.)))## # A tibble: 3 x 4
## ID draws max min_max
## <int> <list> <dbl> <list>
## 1 1 <dbl [10]> 0.758 <dbl [2]>
## 2 2 <dbl [10]> 1.06 <dbl [2]>
## 3 3 <dbl [10]> 1.69 <dbl [2]>
This flexibility is only possible via the use of list-columns and map_* functions. This workflow is extremely common. We start with an empty tibble, using ID to specify the number of rows. With that skeleton, each step of the pipe adds a new column, working off a column which already exists.
5.4 Two models
The simplest possible setting for inference involves two models — meaning two possible states of the world — and two outcomes from an experiment. Imagine that there is a disease — Probophobia, an irrational fear of probability — which you either have or don’t have. We don’t know if you have the diseases, but we do assume that there are only two possibilities.
We also have a test which is 99% accurate when given to a person who has Probophobia. Unfortunately, the test is only 50% accurate for people who do not have Probophobia. In this experiment, there only two possible outcomes: a positive and a negative result on the test.
Question: If you test positive, what is the probability that you have Probophobia?
More generally, we are estimating a conditional probability. Conditional on the outcome of a postive test, what is the probability that you have Probophobia? Mathematically, we want:
\[ P(\text{Probophobia | Test = Postive} ) \]
To answer this question, we need to use the tools of joint and conditional probability from earlier in the Chapter. We begin by building, by hand, the joint distribution of the possible models (you have the Probophobia or you do not) and of the possible outcomes (you test positive or negative). Building the joint distribution involves assuming that each model is true and then creating the distribution of outcomes which might occur if that assumption is true.
For example, assume you have Probophobia. There is then a 50% chance that you test positive and a 50% chance you test negative. Similarly, if we assume that the second model is true — that you don’t have Probophobia — then there is 1% chance you test positive and a 99% you chance negative. Of course, for you (or any individual) we do not know for sure what is happening. We do not know if you have the disease. We do not know what your test will show. But we can use these relationships to construct the joint distribution.
# Pipes generally start with tibbles, so we start with a tibble which just
# includes an ID variable. We don't really use ID. It is just handy for getting
# organized. We call this object `jd_disease`, where the `jd` stands for
# joint distribution.
sims <- 10000
jd_disease <- tibble(ID = 1:sims, have_disease = rep(c(TRUE, FALSE), 5000)) %>%
mutate(positive_test =
if_else(have_disease,
map_int(have_disease, ~ rbinom(n = 1, size = 1, p = 0.99)),
map_int(have_disease, ~ rbinom(n = 1, size = 1, p = 0.5))))
jd_disease## # A tibble: 10,000 x 3
## ID have_disease positive_test
## <int> <lgl> <int>
## 1 1 TRUE 1
## 2 2 FALSE 1
## 3 3 TRUE 1
## 4 4 FALSE 1
## 5 5 TRUE 1
## 6 6 FALSE 0
## 7 7 TRUE 1
## 8 8 FALSE 1
## 9 9 TRUE 1
## 10 10 FALSE 0
## # … with 9,990 more rows
The first step is to simply create an tibble that consists of the simulated data we need to plot our distribution. Keep in mind that in the setting we have two different probabilities and they are completely separate from each other and we want to keep the two probabilities and the disease results in two and only two columns so that we can graph using the ggplot() function. And that’s why we used the rep and seq functions when creating the table, we used the seq function to set the sequence we wants, in this case is only two numbers, 0.01 (99% accuracy for testing negative if no disease, therefore 1% for testing positive if no disease) and 0.5 (50% accuracy for testing positive/negative if have disease), then we used the rep functions to repeat the process 10,000 times for each probability, in total 20,000 times. Note that this number “20,000” also represent the population in our simulated data, we simulated 20,000 results from testing, where 10,000 results from the have-disease group and 10,000 for the no-disease group, we often use the capital N to represent the population, in this simulated data N=20,000.
Plot the joint distribution:

Below is a joint distribution displayed in 3D. Instead of using the “jitter” feature in R to unstack the dots, we are using a 3D plot to visualize the number of dots in each box. The number of people who correctly test negative is far greater than of the other categories. The 3D plot shows the total number of cases for each section (True positive, True negative, False positive, False negative),the 3D bar coming from those combinations. Now,pay attention to the two rows of the 3D graph, if you trying to add up the length of the 3D bar for the top two sections and the bottom two sections, they should be equal to each other, where each have 10,000 case. This is because we simulate the experience in two independent and separate world one in the have-disease world and one in the no-disease world.
This Section is called “Two Models” because, for each person, there are two possible states of the world: have the disease or not have the disease. By assumption, there are no other outcomes. We call these two possible states of the world “models,” even though they are very simple models.
In addition to the two models, we have two possible results of our experiment on a given person: test positive or test negative. Again, this is an assumption. We do not allow for any other outcome. In coming sections, we will look at more complex situations where we consider more than two models and more than two possible results of the experiment. In the meantime, we have built the unnormalized joint distribution for models and results. This is a key point! Look back earlier in this Chapter for discussions about both unnormalized distributions and joint distributions.
We want to analyze these plots by looking at different slices. For instance, let’s say that you have tested positive for the disease. Since the test is not always accurate, you cannot be 100% certain that you have it. We isolate the slice where the test result equals 1 (meaning positive).
jd_disease %>%
filter(positive_test == 1)## # A tibble: 7,484 x 3
## ID have_disease positive_test
## <int> <lgl> <int>
## 1 1 TRUE 1
## 2 2 FALSE 1
## 3 3 TRUE 1
## 4 4 FALSE 1
## 5 5 TRUE 1
## 6 7 TRUE 1
## 7 8 FALSE 1
## 8 9 TRUE 1
## 9 11 TRUE 1
## 10 12 FALSE 1
## # … with 7,474 more rows
Most people test positive are infected This is a result for common diseases like cold. We can easily create an unnormalized conditional distribution with:

filter() transforms a joint distribution into a conditional distribution.
Turn this unnormalized distribution into a posterior probability distribution:

If we zoom in on the plot, about 70% of people who tested positive have the disease and 30% who tested positive do not have the disease. In this case, we are focusing on one slice of the probability distribution where the test result was positive. There are two disease outcomes: positive or negative. By isolating a section, we are looking at a conditional distribution. Conditional on a positive test, you can visualize the likelihood of actually having the disease versus not.
Now recalled the question we asked at the start of the session: If you test positive, what is the probability that you have Probophobia?
By looking at the posterior graph we just create, we can answer this question easily: With a positive test, you can be almost 70% sure that you have Probophobia, however there is a good chance about 30% that you receive a false positive, so don’t worry too much there is still about a third of hope that you get the wrong result
Now let’s consider the manipulation of this posterior, here is another question. Question : 10 people walks up to testing center, 5 of them tested negative, 5 of them tested positive, what is the probability of at least 6 people is actually healthy?
tibble(test = 1:100000) %>%
mutate(person1 = map_int(test, ~ rbinom(n = 1, size = 1, p = 0.3))) %>%
mutate(person2 = map_int(test, ~ rbinom(n = 1, size = 1, p = 0.3))) %>%
mutate(person3 = map_int(test, ~ rbinom(n = 1, size = 1, p = 0.3))) %>%
mutate(person4 = map_int(test, ~ rbinom(n = 1, size = 1, p = 0.3))) %>%
mutate(person5 = map_int(test, ~ rbinom(n = 1, size = 1, p = 0.3))) %>%
mutate(person6 = map_int(test, ~ rbinom(n = 1, size = 1, p = 0.7))) %>%
mutate(person7 = map_int(test, ~ rbinom(n = 1, size = 1, p = 0.7))) %>%
mutate(person8 = map_int(test, ~ rbinom(n = 1, size = 1, p = 0.7))) %>%
mutate(person9 = map_int(test, ~ rbinom(n = 1, size = 1, p = 0.7))) %>%
mutate(person10 = map_int(test, ~ rbinom(n = 1, size = 1, p = 0.7))) %>%
select(!test) %>%
mutate(sum = rowSums(.))%>%
ggplot(aes(sum)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
binwidth = 1,
color = "white") +
scale_x_continuous(breaks = c(0:10)) +
scale_y_continuous(labels =
scales::percent_format(accuracy = 1)) +
theme_classic() 
This Stat 110 Animations video does a really good job of explaining similar concepts.
5.5 Three models

Imagine that your friend gives you a bag with two marbles. There could either be two white marbles, two black marbles, or one of each color. Thus, the bag could contain 0% white marbles, 50% white marbles, or 100% white marbles. Respectively, the proportion, \(p\), of white marbles could be 0, 0.5, or 1.
Question: What is the chance of the bag contains exactly two white marbles, given that when we selected the marbles three times, everytime we select a white marble?
\[ P(\text{2 White Marbles in bag | White Marbles Sampled = 3} ) \] Just as during the Probophobia models, in order to answer this question, we need to start up with the simulated data and then graphing out the joint distribution of this sinerio because we need to considered all possible outcomes of this model, and then based on the joint distribution we can slice out the the part we want (Conditional distribution) in the end making an posterior graph as well as normalizing it to see the probability.
Step 1: Simulate the data into an tibble
Let’s say you take a marble out of the bag, record whether it’s black or white, then return it to the bag. You repeat this three times, observing the number of white marbles you see out of three trials. You could get three whites, two whites, one white, or zero whites as a result of this trial. We have three models (three different proportions of white marbles in the bag) and four possible experimental results. Let’s create 3,000 draws from this joint distribution:
# Create the joint distribution of the number of white marbles in the bag
# (in_bag) and the number of white marbles pulled out in the sample (in_sample),
# one-by-one. in_bag takes three possible values: 0, 1 and 2, corresponding to
# zero, one and two white marbles potentially in the bag.
set.seed(3)
sims <- 10000
# We also start off with a tibble. It just makes things easier
jd_marbles <- tibble(ID = 1:sims) %>%
# For each row, we (randomly!) determine the number of white marbles in the
# bag. We do not know why the `as.integer()` hack is necessary. Shouldn't
# `map_int()` automatically coerce the result of `sample()` into an integer?
mutate(in_bag = map_int(ID, ~ as.integer(sample(c(0, 1, 2),
size = 1)))) %>%
# Depending on the number of white marbles in the bag, we randomly draw out 0,
# 1, 2, or 3 white marbles in our experiment. We need `p = ./2` to transform
# the number of white marbles into the probability of drawing out a white
# marble in a single draw. That probability is either 0%, 50% or 100%.
mutate(in_sample = map_int(in_bag, ~ rbinom(n = 1,
size = 3,
p = ./2)))
jd_marbles## # A tibble: 10,000 x 3
## ID in_bag in_sample
## <int> <int> <int>
## 1 1 0 0
## 2 2 1 3
## 3 3 2 3
## 4 4 1 1
## 5 5 2 3
## 6 6 2 3
## 7 7 1 0
## 8 8 2 3
## 9 9 0 0
## 10 10 1 2
## # … with 9,990 more rows
Step 2: Plot the joint distribution:
# The distribution is unnormalized. All we see is the number of outcomes in each
# "bucket." Although it is never stated clearly, we are assuming that there is
# an equal likelihood of 0, 1 or 2 white marbles in the bag.
jd_marbles %>%
ggplot(aes(x = in_sample, y = in_bag)) +
geom_jitter(alpha = 0.5) +
labs(title = "Black and White Marbles",
subtitle = "More white marbles in bag mean more white marbles selected",
x = "White Marbles Selected",
y = "White Marbles in the Bag") +
scale_y_continuous(breaks = c(0, 1, 2)) +
theme_classic()
Here is the 3D visualization:
The y-axes of both the scatterplot and the 3D visualization are labeled “Number of White Marbles in the Bag.” Each value on the y-axis is a model, a belief about the world. For instance, when the model is 0, we have no white marbles in the bag, meaning that none of the marbles we pull out in the sample will be white.
Now recalls the question, we essentially only care about the fourth column in the joint distribution (x-axis=3) because the question is asking us to create a conditional distribution given that fact that 3 marbles were selected. Therefore, we could isolate the slice where the result of the simulation involves three white marbles and zero black ones. Here is the unnormalized probability distribution.
Step 3: Plot the unnormalized conditional distribution.
# The key step is the filter. Creating a conditional distribution from a joint
# distribution is the same thing as filtering that joint distribution for a
# specific value. A conditional distribution is a "slice" of the joint
# distribution, and we take that slice with filter().
jd_marbles %>%
filter(in_sample == 3) %>%
ggplot(aes(in_bag)) +
geom_histogram(binwidth = 0.5, color = "white") +
labs(title = "Unnormalized Conditional Distribution",
subtitle = "Number of white marbles in bag given that three were selected in the sample",
x = "Number of White Marbles in the Bag",
y = "Count") +
coord_cartesian(xlim = c(0, 2)) +
scale_x_continuous(breaks = c(0, 1, 2)) +
theme_classic()
Step 4: Plot the normalize posterior distribution. Next, let’s normalize the distribution.
jd_marbles %>%
filter(in_sample == 3) %>%
ggplot(aes(in_bag)) +
geom_histogram(aes(y = after_stat(count/sum(count))),
binwidth = 0.5,
color = "white") +
labs(title = "Posterior Probability Distribution",
subtitle = "Number of white marbles in bag given that three were selected in the sample",
x = "Number of White Marbles in the Bag",
y = "Probability") +
coord_cartesian(xlim = c(0, 2)) +
scale_x_continuous(breaks = c(0, 1, 2)) +
scale_y_continuous(labels =
scales::percent_format(accuracy = 1)) +
theme_classic()
This plot makes sense because when all three marbles you draw out of the bag are white, there is a pretty good chance that there are no black marbles in the bag. But you can’t be certain! It is possible to draw three white even if the bag contains one white and one black. However, it is impossible that there are zero white marbles in the bag.
Lastly let’s answer the question: What is the chance of the bag contains exactly two white marbles, given that when we selected the white marbles three times, everytime we select a white marble?
Answer: As the Posterior Probability Distribution shows (x-axis=2), the chance of the bag contains exactly two white marbles given that we select 3 white marbles out of three tries is about 85%.
5.6 N models

Assume that there is a coin with \(\rho_h\). We guarantee that there are only 11 possible values of \(\rho_h\): \(0, 0.1, 0.2, ..., 0.9, 1\). In other words, there are 11 possible models, 11 things which might be true about the world. This is just like situations we have previously discussed, except that there are more models to consider.
We are going to run an experiment in which you flip the coin 20 times and record the number of heads. What does this result tell you about the value of \(\rho_h\)? Ultimately, we will want to calculate a posterior distribution of \(\rho_h\), which is written as p(\(\rho_h\)).
Question: What is the probability of getting exactly 8 heads out of 20 tosses?
To start, it is useful to consider all the things which might happen if, for example, \(\rho_h = 0.4\). Fortunately, the R functions for simulating random variables makes this easy.

First, notice that many different things can happen! Even if we know, for certain, that \(\rho_h = 0.4\), many outcomes are possible. Life is remarkably random. Second, the most likely result of the experiment is 8 heads, as we would expect. Third, we have transformed the raw counts of how many times each total appeared into a probability distribution. Sometimes, however, it is convenient to just keep track of the raw counts. The shape of the figure is the same in both cases.

Either way, the figures show what would have happened if that model — that \(\rho_h = 0.4\) — were true.
We can do the same thing for all 11 possible models, calculating what would happen if each of them were true. This is somewhat counterfactual since only one of them can be true. Yet this assumption does allow us to create the joint distribution of models which might be true and of data which our experiment might generate. Let’s simplify this is p(models, data), although you should keep the precise meaning in mind.

Here is the 3D version of the same plot.
In both of these diagrams, we see 11 models and 21 outcomes. We don’t really care about the p(\(models\), \(data\)), the joint distribution of the models-which-might-be-true and the data-which-our-experiment-might-generate. Instead, we want to estimate \(p\), the unknown parameter which determines the probability that this coin will come up heads when tossed. The joint distribution alone can’t tell us that. We created the joint distribution before we had even conducted the experiment. It is our creation, a tool which we use to make inferences. Instead, we want the conditional distribution, p(\(models\) | \(data = 8\)). We have the results of the experiment. What do those results tell us about the probability distribution of \(p\)?
To answer this question, we simply take a vertical slice from the joint distribution at the point of the x-axis corresponding to the results of the experiment.
This animation shows what we want to do with joint distributions. We take a slice (the red one), isolate it, rotate it to look at the conditional distribution, normalize it (change the values along the current z-axis from counts to probabilities), then observe the resulting posterior.

This is the only part of the joint distribution that we care about. We aren’t interested in what the object looks like where, for example, the number of heads is 11. That portion is irrelevant because we observed 8 heads, not 11. By using the filter function on the simulation tibble we created, we can conclude that there are a total of 465 times in our simulation in which 8 heads were observed.
As we would expect, most of the time when 8 coin tosses came up heads, the value of \(p\) was 0.4. But, on numerous occasions, it was not. It is quite common for a value of \(p\) like 0.3 or 0.5 to generate 8 heads. Consider:

Yet this is a distribution of raw counts. It is an unnormalized density. To turn it into a proper probability density (i.e., one in which the sum of the probabilities across possible outcomes sums to one) we just divide everything by the total number of observations.

Solution:
The most likely value of \(\rho_h\) is 0.4, as before. But, it is much more likely that \(p\) is either 0.3 or 0.5. And there is about an 8% chance that \(\rho_h \ge 0.6\).
You might be wondering: what is the use of a model? Well, let’s say we toss the coin 20 times and get 8 heads again. Given this result, we can ask: What is the probability that future samples of 20 flips will result in 10 or more heads?
There are three main ways you could go about solving this problem with simulations.
The first wrong way to do this is assuming that \(\rho_h\) is certain because we observed 8 heads after 20 tosses. We would conclude that 8/20 gives us 0.4. The big problem with this is that you are ignoring your uncertainty when estimating \(\rho_h\). This would lead us to the following code.
sims <- 10000000
odds <- tibble(sim_ID = 1:sims) %>%
mutate(heads = map_int(sim_ID, ~ rbinom(n = 1, size = 20, p = .4))) %>%
mutate(above_ten = if_else(heads >= 10, TRUE, FALSE))
odds## # A tibble: 10,000,000 x 3
## sim_ID heads above_ten
## <int> <int> <lgl>
## 1 1 9 FALSE
## 2 2 7 FALSE
## 3 3 5 FALSE
## 4 4 8 FALSE
## 5 5 9 FALSE
## 6 6 9 FALSE
## 7 7 8 FALSE
## 8 8 8 FALSE
## 9 9 4 FALSE
## 10 10 6 FALSE
## # … with 9,999,990 more rows
odds %>%
ggplot(aes(x=heads,fill=above_ten))+
geom_histogram(aes(y = after_stat(count/sum(count))),bins = 50)+
scale_fill_manual(values = c('grey50', 'red'))+
labs(title = "Posterior Probability Distribution (Wrong Way)",
subtitle = "Number of heads in 20 tosses",
x = "Number of heads",
y = "Probability",
fill = "Above ten heads") +
scale_x_continuous(labels = scales::number_format(accuracy = 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_classic()
Using this Posterior distribution derived from the (wrong way) simulated data, the probability results in 10 or more head is
odds %>%
summarize(success = sum(above_ten)/sims)## # A tibble: 1 x 1
## success
## <dbl>
## 1 0.245
about 24.5%.
The second method involves sampling the whole posterior distribution vector we previously created. This would lead to the following correct code.
p_draws <- tibble(p = rep(seq(0, 1, 0.1), 1000)) %>%
mutate(heads = map_int(p, ~ rbinom(n = 1, size = 20, p = .))) %>%
filter(heads == 8)
odds_2nd <- tibble(p = sample(p_draws$p, size = sims, replace = TRUE)) %>%
mutate(heads = map_int(p, ~ rbinom(n = 1, size = 20, p = .))) %>%
mutate(above_ten = if_else(heads >= 10, TRUE, FALSE))
odds_2nd## # A tibble: 10,000,000 x 3
## p heads above_ten
## <dbl> <int> <lgl>
## 1 0.5 9 FALSE
## 2 0.4 7 FALSE
## 3 0.4 9 FALSE
## 4 0.3 7 FALSE
## 5 0.5 12 TRUE
## 6 0.3 9 FALSE
## 7 0.4 11 TRUE
## 8 0.3 8 FALSE
## 9 0.4 10 TRUE
## 10 0.4 7 FALSE
## # … with 9,999,990 more rows
odds_2nd %>%
ggplot(aes(x = heads,fill = above_ten))+
geom_histogram(aes(y = after_stat(count/sum(count))),bins = 50)+
scale_fill_manual(values = c('grey50', 'red'))+
labs(title = "Posterior Probability Distribution (Right Way)",
subtitle = "Number of heads in 20 tosses",
x = "Number of heads",
y = "Probability",
fill = "Above ten heads") +
scale_x_continuous(labels = scales::number_format(accuracy = 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_classic()
Using this Posterior distribution derived from the (right way 1st) simulated data, the probability results in 10 or more head is
odds_2nd %>%
summarize(success = sum(above_ten)/sims)## # A tibble: 1 x 1
## success
## <dbl>
## 1 0.328
about 32.8%
As you may have noticed, if you calculated the value using the first method, you would believe that getting 10 or more heads is less likely than it really is. If you were to run a casino based on these assumptions, you will lose all your money. It is very important to be careful about the assumptions you are making. We tossed a coin 20 times and got 8 heads. However, you would be wrong to assume that \(\rho_h\) = 0.4 just based on this result.
5.7 Working with probability distributions
A probability distribution is not always easy to work with. It is a complex object. And, in many contexts, we don’t really care about all that complexity. So, instead of providing the full probability distribution, we often just use a summary measure, a number or two or three which captures those aspects of the entire distribution which are relevant to the matter at hand. Let’s explore these issues using the 538 posterior probability distribution, as of August 13, 2020, for the number of electoral votes which will be won by Joe Biden. Here is a tibble with 1,000,000 draws from that distribution:
draws## # A tibble: 1,000,000 x 2
## ID electoral_votes
## <int> <int>
## 1 1 342
## 2 2 297
## 3 3 348
## 4 4 229
## 5 5 407
## 6 6 245
## 7 7 272
## 8 8 253
## 9 9 434
## 10 10 419
## # … with 999,990 more rows
A distribution and a sample of draws from that distribution are different things. But, if you squint, they are sort of the same thing, at least for our purposes. For example, if you want to know the mean of the distribution, then the mean of the draws will be a fairly good estimate, especially if the number of draws is large enough.
Recall from Chapter 2 how we can draw randomly from specified probability distributions:
rnorm(10)## [1] -1.271 -0.018 -0.331 -0.066 0.535 0.122 0.446 0.457 0.952 0.280
runif(10)## [1] 0.50 0.97 0.46 0.28 0.28 0.28 0.47 0.76 0.74 0.10
The elements of these vectors are all “draws” from the specified probability distributions. In most applied situations, our tools will produce draws rather than summary objects. Fortunately, a vector of draws is very easy to work with. Start with summary statistics:
# recall mean, media, standard deviation and mad functions.
key_stats <- draws %>%
summarize(mn = mean(electoral_votes),
md = median(electoral_votes),
sd = sd(electoral_votes),
mad = mad(electoral_votes))
key_stats## # A tibble: 1 x 4
## mn md sd mad
## <dbl> <dbl> <dbl> <dbl>
## 1 325. 326 86.9 101.
Calculate a 95% interval directly:
## 2.5% 97.5%
## 172 483
Approximate the 95% interval in two ways:
c(key_stats$mn - 2 * key_stats$sd,
key_stats$mn + 2 * key_stats$sd)## [1] 152 499
c(key_stats$md - 2 * key_stats$mad,
key_stats$md + 2 * key_stats$mad)## [1] 124 528
In this case, using the mean and standard deviation produces a 95% interval which is closer to the true interval. In other cases, the median and scaled median absolute deviation will do better. Either approximation is generally “good enough” for most work. But, if you need to know the exact 95% interval, you must use quantile().
5.8 Cardinal Virtues
The four Cardinal Virtues are Wisdom, Justice, Courage, and Temperance. Because data science is, ultimately, a moral act, we use these virtues to guide our work. Every data science project begins with a question.
Wisdom starts by creating the Preceptor Table. What data, if we had it, would allow us to answer our question easily? If the Preceptor Table has one outcome, then the model is predictive. If it has more than one (potential) outcome, then the model is causal. We then explore the data we have. You can never look too closely at your data. Key question: Are the data we have close enough to the data we want (i.e., the Preceptor Table) that we can consider both as coming from the same population? If not, we can’t proceed further.
Justice starts with the Population Table – the data we want to have, the data which we actually have and all the other data from that same population. Each row of the Population Table is defined by a unique Unit/Time combination. We explore three key issues about the Population Table. First, do the columns demonstrate validity, i.e., is the meaning consistent across the rows? Second, does the relationship among the variables demonstrate stability, meaning is the model stable across different time periods? Third, are the rows associated with the data representative of all the units which we might have had data for? Justice concludes by making an assumption about the data generating mechanism. What general mathematical formula connects the outcome variable we are interested in with the other data that we have?
Courage allows us to explore different models. Even though Justice has provided the basic mathematical structure of the model, we still need to decide which variables to include and to estimate the values of unknown parameters. We avoid hypothesis tests. We check our models for consistency with the data we have. We select one model.
Temperance guides us in the use of the model we have created to answer the questions we began with. We create posteriors of quantities of interest. We should be modest in the claims we make. The posteriors we create are never the “truth.” The assumptions we made to create the model are never perfect. Yet decisions made with flawed posteriors are almost always better than decisions made without them.
5.8.1 Wisdom

FIGURE 5.4: Wisdom.
Wisdom helps us decide if we can even hope to answer our question with the data we have.
First, start with the Preceptor Table. What rows and columns of data do you need such that, if you had them all, the calculation of the quantity of interest would be trivial? If you want to know the average height of an adult in India, then the Preceptor Table would include a row for each adult and a column for their height. With no missing data, the average is easy to determine, as are a wide variety of other estimands, other unknown numbers.
One key aspect of this Preceptor Table is whether or not we need more than one potential outcome in order to calculate our estimand. For example, if we want to know the causal effect of exposure to Spanish-speakers on attitude toward immigration then we need a causal model, one which estimates that attitude for each person under both treatment and control. The Preceptor Table would require two columns for the outcome. If, on the other hand, we only want to predict someone’s attitude, or compare one person’s attitude to another’s, then we would only need a Preceptor Table with one column for the outcome.
Are we are modeling (just) for prediction or are we (also) modeling for causation? Predictive models care nothing about causation. Causal models are often also concerned with prediction, if only as a means of measuring the quality of the model.
Every model is predictive, in the sense that, if we give you new data — and it is drawn from the same population — then you can create a predictive forecast. But only a subset of those models are causal, meaning that, for a given individual, you can change the value of one input and figure out what the new output would be and then, from that, calculate the causal effect by looking at the difference between two potential outcomes.
With prediction, all we care about is forecasting Y given X on some as-yet-unseen data. But there is no notion of “manipulation” in such models. We don’t pretend that, for Joe, we could turn variable X from a value of 5 to a value of 6 by just turning some knob and, by doing so, cause Joe’s value of Y to change from 17 to 23. We can compare two people (or two groups of people), one with X equal to 5 and one with X equal to 6, and see how they differ in Y. The basic assumption of predictive models is that there is only one possible Y for Joe. There are not, by assumption, two possible values for Y, one if X equal 5 and another if X equals 6. The Preceptor Table has a single column under Y.
With causal inference, however, we can consider the case of Joe with \(X = 5\) and Joe with \(X = 6\). The same mathematical model can be used. And both models can be used for prediction, for estimating what the value of Y will be for a yet-unseen observation with a specified value for X. But, in this case, instead of only a single column in the Preceptor Table for Y, we have at least two (and possibly many) such columns, one for each of the potential outcomes under consideration.
The difference between prediction models and causal models is that the former have one column for the outcome variable and the latter have more than one.
Second, we look at the data we have and perform an exploratory data analysis, an EDA. You can never look at your data too much. The most important variable is the one we most want to understand/explain/predict. In the models we create in later chapters, this variable will go on the left-hand side of our mathematical equations. Some academic fields refer to this as the “dependent variable.” Others use terms like “regressor” or “outcome.” Whatever the terminology, we need to explore the distribution of this variable, its min/max/range, its mean and median, its standard deviation, and so on.
Gelman, Hill, and Vehtari (2020) write:
Most important is that the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. Optimally, this means that the outcome measure should accurately reflect the phenomenon of interest, the model should include all relevant predictors, and the model should generalize to the cases to which it will be applied.
For example, with regard to the outcome variable, a model of incomes will not necessarily tell you about patterns of total assets. A model of test scores will not necessarily tell you about child intelligence or cognitive development. …
We care about other variables as well, especially those that are most correlated/connected with the outcome variable. The more time that we spend looking at these variables, the more likely we are to create a useful model.
Third, a key concept is the “population.” We need the data we want — the Preceptor Table — and the data we have to be similar enough that we can consider them as all having come from the same statistical population. From Wikipedia:
In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hypothetical and potentially infinite group of objects conceived as a generalization from experience (e.g. the set of all opening hands in all the poker games in Las Vegas tomorrow).
If we assume that the data we have is drawn from the same population as the data in the Preceptor Table, then we can use information about the former to make inferences about the latter. We can combine the Preceptor Table and the data into a single Population Table. If we can’t do that, if we can’t assume that the two sources come from the same population, then we can’t use our data to answer our questions. We have no choice but to walk away. The heart of Wisdom is knowing when to walk away. As John Tukey noted:
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.
5.8.2 Justice

FIGURE 5.5: Justice.
After Wisdom, we have an Population Table. It includes rows for the data we have and the data we want to have. It has missing values, most importantly for potential outcomes which were not observed. The central problem in inference is to fill in the question marks in the Population Table.
There are three key issues to explore in any Population Table: validity, stability and representativeness.
- Validity focuses on the columns of the Population Table. We should confirm that the data we have accurately captures the concepts we care about. Is the data valid, given the problem we are trying to solve?
Stability assumes that the relationship between the outcome variable and the covariates is consistent over time. Never forget that temporal nature of almost all real data science problems. Our Preceptor Table will focus on rows for today or for the near future. The data we have will always be from before today. We must almost always assume that the future will be like the past in order to use data from the past to make predictions about the future.
Representativeness is a two-sided concern. We want the data we have to be representative of the population for which we need to calculate parameters. Ideally, we would love for our data to be randomly sampled from the population, but this is almost never the case. But this is concern, not just with our data, but also for our Preceptor Table. If the data we want is not representative of the entire population then we will need to be careful in the inferences which we draw.
Validity is about the columns in our Population Table. Stability and representativeness are about the rows.
The last step of Justice is to make an assumption about the structure of the data generating mechanism (DGM): the mathematical formula, and associated error term, which relates our outcome variable to our covariates.
Justice requires math. Consider a model of coin-tossing:
\[ H_i \sim B(\rho_H, n = 20) \]
The total number \(H\) of Heads in experiment \(i\) with 20 flips of a single coin, \(H_i\), is distributed as a binomial with \(n = 20\) and an unknown probability \(\rho_h\) of the coin coming up Heads.
Note:
This is a cheat and a simplification! We are Bayesians but we have not specified the full Bayesian machinery. We really need priors on the unknown parameter \(\rho_h\) as well. But that is too complex for an introductory class, so we wave our hands, accept the default sensible parameters built into the R packages we use and point readers to more advanced books, like Gelman, Hill, and Vehtari (2020).
Defining \(\rho_h\) as the “the probability that the coin comes up Heads” is a bit of a fudge. If you calculate that by hand and then compare it to what our tools produce, they won’t be the same. Instead, the calculated value will be closer to zero. Why? \(\rho_h\) is really the “long-run percentage of the time the coin comes up Heads.” It is not just the percentage from this experiment.
- In this simple case, we are fortunate that the parameter \(\rho_h\) has such a (mostly!) simple analog to a real world quantity. Most of the time, parameters are not so easy to interpret. In a more complex model, especially with one with interaction terms, we focus less on parameters and more on actual predictions.
5.8.3 Courage

FIGURE 5.6: Courage.
The three languages of data science are words, math and code, and the most important of these is code. We need to explain the structure of our model using all three languages, but we need Courage to implement the model in code.
Courage requires us to take the general mathematical formula provide by Justice and then make it specific. Which variables should we include in the model and which do we exclude? Every data science project involves the creation of several models. For each, we specify the precise data generating mechanism. Using that formula, and some R code, we create a fitted model. All models have parameters. We can never know the true values of the parameters, but we can create, and explore, posterior distributions for those unknown true values.
Code allows us to “fit” a model by estimating the values of the unknown parameters, like \(\rho_h\). Sadly, we can never know the true values of these parameters. But, like all good data scientists, we can express our uncertain knowledge in the form of posterior probability distributions. With those distributions, we can compare the actual values of the outcome variable with the “fitted” or “predicted” results of the model. We can examine the “residuals,” the difference between the fitted and actual values.
Every outcome is the sum of two parts: the model and what is not in the model:
\[outcome = model + what\ is\ not\ in\ the\ model\]
It doesn’t matter what the outcome is. It could be the result of a coin flip, the weight of a person, the GDP of a country. Whatever outcome we are considering is always made up of two parts. The first is the model we have created. The second is all the stuff — all the blooming and buzzing complexity of the real world — which is not a part of the model.
Some of our uncertainty is driven by our ignorance about \(\rho_h\).
A parameter is something which does not exist in the real world. (If it did, or could, then it would be data.) Instead, a parameter is a mental abstraction, a building block which we will use to to help us accomplish our true goal: To replace at least some of the questions marks in the actual Preceptor Table. Since parameters are mental abstractions, we will always be uncertain as to their value, however much data we might collect.
But some, often most, of the uncertainty comes from forces that are, by assumption, not in the model. For example, if the coin is fair, we expect \(T_i\) to equal 10. But, often, it will be different, even if we are correct and \(\rho_h\) equals exactly 0.5.
Some randomness is intrinsic in this fallen world.
5.8.4 Temperance

FIGURE 5.7: Temperance.
There are few more important concepts in statistics and data science than the “Data Generating Mechanism.” Our data — the data that we collect and see — has been generated by the complexity and confusion of the world. God’s own mechanism has brought His data to us. Our job is to build a model of that process, to create, on the computer, a mechanism which generates fake data consistent with the data which we see. With that DGM, we can answer any question which we might have. In particular, with the DGM, we provide predictions of data we have not seen and estimates of the uncertainty associated with those predictions. Justice gave us the structure of the DGM. Courage created the DGM, the fitted model. Temperance will guide us in its use.
Having created (and checked) a model, we now use the model to answer questions. Models are made for use, not for beauty. The world confronts us. Make decisions we must. Our decisions will be better ones if we use high quality models to help make them.
Sadly, our models are never as good as we would like them to be. First, the world is intrinsically uncertain.

FIGURE 5.8: Donald Rumsfeld.
There are known knowns. There are things we know we know. We also know there are known unknowns. That is to say, we know there are some things we do not know. But there are also unknown unknowns, the ones we do not know we do not know. – Donald Rumsfeld
What we really care about is data we haven’t seen yet, mostly data from tomorrow. But what if the world changes, as it always does? If it doesn’t change much, maybe we are OK. If it changes a lot, then what good will our model be? In general, the world changes some. That means that are forecasts are more uncertain that a naive use of our model might suggest.

FIGURE 5.9: Three Card Monte.
What does this mean? Well imagine a crowd playing Three Card Monte in the streets of New York. The guy running the game runs a demo and shows you all the cards to make you confident. They earn money by making you overconfident and persuading you to bet. Your odds may seem good during the demo round, but that doesn’t actually say anything about what will likely happen when the real, high stakes game begins. The person running the game does many simulations, making the “victim” forget that they cannot actually make any conclusions about the odds of winning. There are some variables that we simply do not know even if we put a lot of effort into making posterior probability distributions. People can be using slight of hand, for instance.
We need patience in order to study and understand the unknown unknowns in our data. Patience is also important when we analyze the “realism” of our models. When we created the mathematical probability distribution for presidential elections, for instance, we assumed that the Democratic candidate would have a 50% chance of winning each vote in the electoral college. By comparing the mathematical model to our empirical cases, however, we recognize that the mathematical model is unlikely to be true. The mathematical model suggested that getting fewer than 100 votes is next to impossible, but many past Democratic candidates in the empirical distribution received less than 100 electoral votes.
In Temperance, the key distinction is between the true posterior distribution — what we will call “Preceptor’s Posterior” — and the estimated posterior distribution. Recall our discussion from Section 5.1. Imagine that every assumption we made in Wisdom and Justice were correct, that we correctly understand every aspect of how the world works. We still would not know the unknown value we are trying to estimate — recall the Fundamental Problem of Causal Inference — but the posterior we created would be perfect. That is Preceptor’s Posterior. Sadly, even if our estimated posterior is, very close to Preceptor’s Posterior, we can never be sure of that fact, because we can never know the truth, never be certain that all the assumptions we made are correct.
Even worse, we must always worry that our estimated posterior, despite all the work we put into creating it, is far from the truth. We, therefore, must be cautious in our use of that posterior, humble in our claims about its accuracy. Using our posterior, despite its fails, is better than not using it. Yet it is, as best, a distorted map of reality, a glass through which we must look darkly. Use your posterior with humility.
5.9 Summary
Throughout this Chapter, we spent time going through examples of conditional distributions. However, it’s worth noting that all probability distributions are conditional on something. Even in the most simple examples, when we were flipping a coin multiple times, we were assuming that the probability of getting heads versus tails did not change between tosses.
We also discussed the difference between empirical, mathematical, and posterior probability distributions. Even though we developed these heuristics to better understand distributions, every time we make a claim about the world, it is based on our beliefs - what we think about the world. We could be wrong. Our beliefs can differ. Two reasonable people can have conflicting beliefs about the fairness of a die.
At the start of this chapter we briefly discuss the definition of an random variable, yet we sort of let it go for the rest of the chapter, but it’s hiding almost everywhere whenever we create an distribution. For example, in two models we have two random variables, the have disease and the don’t have disease, in three models we have three random variables, either to have 0 or 1 or 2 white marbles in bad. Essentially is just like the missing values in the Preceptor table what random variables do you need to know the values of to answer the question?
It is useful to understand the three types of distributions and the concept of conditional distributions, but almost every probability distribution is conditional and posterior. We can leave out both words in future discussions, as we generally will in this book. They are implicit.
If you are keen to learn more about probability, here is a video featuring Professor Gary King. This is a great way to review some of the concepts we covered in this chapter, albeit at a higher level of mathematics.